Statistics with Python

I wrote the following code snippets while reading the book "Think Stats" by Allen B. Downey which aim is given by its subtitle "Probability and Statistics for Programmers". Initially this notebook served for me as a playground to reproduce various topics of the book. Now it's a quick reminder of statistical concepts and syntax of tools for it.



In [3]:

    
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
import matplotlib.pyplot as plt
import numpy.random as rand
from datetime import datetime, timedelta
import dateutil
from scipy.stats import pareto
import scipy.stats
import random as rnd



In [4]:

    
%pylab inline









    



Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline].
For more information, type 'help(pylab)'.



In [5]:

    
a, m = 1.7, 150.
s = Series(np.random.pareto(a, 10000000) + m)
print("%f < x < %f" % (min(s),max(s)))









    



150.000000 < x < 12014.795739



In [6]:

    
s.describe()









    Out[6]:





count    10000000.000000
mean          151.427570
std             9.566633
min           150.000000
25%           150.184620
50%           150.503956
75%           151.261504
max         12014.795739
dtype: float64



In [7]:

    
count, bins, ignored = plt.hist(s, 100, normed=True)
fit = a*m**a/bins**(a+1)
plt.xlim([0,1000])
plt.plot(bins, max(count)*fit/max(fit),linewidth=2, color='r')
plt.show()



In [8]:

    
rv = pareto([.5,] * scipy.stats.pareto.numargs)

Different Probability Functions:



In [51]:

    
s_w = Series(np.random.weibull(2.5,100000))
s_n = Series(np.random.randn(100000))



In [52]:

    
hist_w = np.histogram(s_w,bins=100)
hist_n = np.histogram(s_n,bins=100)



In [53]:

    
plot(hist_w[0],'r',hist_n[0],'b')









    Out[53]:





[<matplotlib.lines.Line2D at 0x1130eb6d0>,
 <matplotlib.lines.Line2D at 0x1130eb950>]



In [56]:

    
# Pearson skewness is better for real world data as it's more robust against outlieres:
def pearson_skew(arr):
    return 3*(arr.mean()-arr.median())/arr.var()

print("mean: %f, median: %f, skew: %f, Pearson skewness: %f" % (s_w.mean(), s_w.median(), scipy.stats.skew(s_w), pearson_skew(s_w)))









    



mean: 0.886366, median: 0.863186, skew: 0.356908, Pearson skewness: 0.481884



In [6]:

    
plt.bar(hist_n[1][:-1],hist_n[0],width=.1)









    Out[6]:





<Container object of 100 artists>



In [7]:

    
CDF_w = Series(hist_w[0]).cumsum()
CDF_n = Series(hist_n[0]).cumsum()
plot(CDF_w,'r',CDF_n,'b')









    Out[7]:





[<matplotlib.lines.Line2D at 0x1083e8810>,
 <matplotlib.lines.Line2D at 0x1083e8dd0>]



In [172]:

    
CDF_w.sum()









    Out[172]:





9219905



In [14]:

    
nrm = scipy.stats.norm(100, 15)



In [15]:

    
plot(nrm.pdf(np.arange(-10,10,.1)),'r', nrm.cdf(np.arange(-10,10,.1)),'b')









    Out[15]:





[<matplotlib.lines.Line2D at 0x108a9f3d0>,
 <matplotlib.lines.Line2D at 0x108a9f650>]



In [195]:

    
6*10**9*nrm.pdf(190)









    Out[195]:





2.4303531399293141



In [203]:

    
n = 4
std([max(random.normal(950,50,n)) for _ in range(365)])









    Out[203]:





35.350982073471727



In [11]:

    
#plot(random.normal(950,50,365),'g',random.normal(1002,35,365),'r')
norm950 = scipy.stats.norm(950, 50)
norm1000 = scipy.stats.norm(1002, 35)
xs = np.arange(800,1200)
plot(xs,norm950.pdf(xs),'r', xs,norm1000.pdf(xs),'b')









    Out[11]:





[<matplotlib.lines.Line2D at 0x108a6f490>,
 <matplotlib.lines.Line2D at 0x108a6f6d0>]



In [233]:

    
flipping_15_coins = scipy.stats.binom(15,.5)
flipping_15_coins.pmf(10) # propability that 10 flips yield head









    Out[233]:





0.091644287109375097



In [17]:

    
# creating a 
import scipy.stats as stats
x = np.linspace(0, 1, 100)
plot(x,stats.beta.pdf(x,3,5),x,stats.beta.pdf(x,5,5),x,stats.beta.pdf(x,5,3),x,stats.beta.pdf(x,1,5),x,stats.beta.pdf(x,5,1))









    Out[17]:





[<matplotlib.lines.Line2D at 0x108fd0890>,
 <matplotlib.lines.Line2D at 0x108fd0ad0>,
 <matplotlib.lines.Line2D at 0x108fd4090>,
 <matplotlib.lines.Line2D at 0x108fd4590>,
 <matplotlib.lines.Line2D at 0x108fd4a90>]



In [74]:

    
ser_twonormals = Series(random.normal(10,1,size=(10000)) + random.normal(50,1,size=(10000)))
ser_twonormals.hist(bins=100)
ser_twonormals.describe()









    Out[74]:





count    10000.000000
mean        59.999976
std          1.428755
min         54.414580
25%         59.036657
50%         59.987147
75%         60.971587
max         66.337055
dtype: float64



In [75]:

    
ser_twonormals = Series(np.concatenate([random.normal(10,1,size=(10000)), random.normal(50,1,size=(10000))]))
ser_twonormals.hist(bins=100)
ser_twonormals.describe()









    Out[75]:





count    20000.000000
mean        29.995583
std         20.030437
min          6.066369
25%          9.990733
50%         29.506300
75%         49.998532
max         53.511374
dtype: float64

Hypothesis Testing



In [91]:

    
cd /Users/florianhoppe/Google Drive/Wissen/Books/thinkstats/thinkstats_resources









    



/Users/florianhoppe/Google Drive/Wissen/Books/thinkstats/thinkstats_resources



In [112]:

    
import survey_using_panda
preg = Pregnancies()
preg.ReadRecords()
#len(preg.records)









    Out[112]:





13593



In [166]:

    
preg_columns = ['caseid','nbrnaliv','babysex','birthwgt_lb','birthwgt_oz','prglength','outcome','birthord','agepreg','finalwgt']
data = []
for record in preg.records:
    data += [[getattr(record,field) for field in preg_columns]]

data = DataFrame(data,columns=preg_columns)
print(data[:3])









    



   caseid nbrnaliv babysex birthwgt_lb birthwgt_oz  prglength  outcome  \
0       1        1       1           8          13         39        1   
1       1        1       2           7          14         39        1   
2       2        3       1           9           2         39        1   

  birthord agepreg      finalwgt  
0        1   33.16   6448.271112  
1        2   39.25   6448.271112  
2        1   14.33  12999.542264



In [322]:

    
#data.birthord[data.birthord=='NA'] = np.NAN
data.birthord.value_counts()









    Out[322]:





NA    4445
1     4413
2     2874
3     1234
4      421
5      126
6       50
7       20
8        7
9        2
10       1
dtype: int64



In [168]:

    
reported_birth = data.ix[data.birthord!='NA',:]



In [176]:

    
first_born = data.ix[data.birthord==1,:]
others = data.ix[(data.birthord!=1) & (data.birthord!='NA'),:]



In [327]:

    
field_to_check = 'finalwgt'#'prglength'
diff_of_means = getattr(first_born,field_to_check).mean()-getattr(others,field_to_check).mean()
# resampling
n = 1000
fb_samples_prglength = getattr(first_born,field_to_check).ix[first_born.index[rnd.sample(xrange(len(first_born)),n)]]
o_samples_prglength = getattr(others,field_to_check).ix[others.index[rnd.sample(xrange(len(others)),n)]]
diff_of_samples = Series(np.array(fb_samples_prglength.astype(float)) - np.array(o_samples_prglength.astype(float)))
diff_of_samples.hist(bins=100)
diff_of_samples.describe()
print("In %.0f %% cases the difference of the field %s is larger then the mean difference %f." % (float(100.0*(diff_of_samples>diff_of_means).sum())/n,field_to_check,diff_of_means))









    



In 57 % cases the difference of the field finalwgt is larger then the mean difference -503.481008.



In [333]:

    
first_born[['prglength','finalwgt']].describe()









    Out[333]:






  
    
      
      prglength
      finalwgt
    
  
  
    
      count
       4413.000000
         4413.000000
    
    
      mean
         38.600952
         8143.762911
    
    
      std
          2.791901
         8270.404870
    
    
      min
          0.000000
          118.656790
    
    
      25%
         39.000000
         4033.606004
    
    
      50%
         39.000000
         6456.845037
    
    
      75%
         40.000000
         9496.575715
    
    
      max
         48.000000
       261879.953864



In [329]:

    
others[['prglength','finalwgt']].describe()









    Out[329]:






  
    
      
      prglength
      finalwgt
    
  
  
    
      count
       4735.000000
         4735.000000
    
    
      mean
         38.522914
         8647.243919
    
    
      std
          2.615852
         9880.977442
    
    
      min
          4.000000
          374.737005
    
    
      25%
         39.000000
         3963.834554
    
    
      50%
         39.000000
         6365.031570
    
    
      75%
         39.000000
         9598.147669
    
    
      max
         50.000000
       261879.953864



In [331]:

    
first_born[['prglength','finalwgt']].describe()-others[['prglength','finalwgt']].describe()









    Out[331]:






  
    
      
      prglength
      finalwgt
    
  
  
    
      count
      -322.000000
       -322.000000
    
    
      mean
         0.078037
       -503.481008
    
    
      std
         0.176049
      -1610.572572
    
    
      min
        -4.000000
       -256.080216
    
    
      25%
         0.000000
         69.771450
    
    
      50%
         0.000000
         91.813467
    
    
      75%
         1.000000
       -101.571954
    
    
      max
        -2.000000
          0.000000

Poisson Distribution



In [31]:

    
def tp():
    ser = Series(np.random.poisson(10,1000))
    return ser[ser<=10].count()

mean([tp() for _ in range(100)])









    Out[31]:





582.76999999999998



In [3]:

    
ser = Series(np.random.poisson(10,1000))



In [35]:

    
ser.hist()









    Out[35]:





<matplotlib.axes.AxesSubplot at 0x107f40750>



In [37]:

    
# it's right-side skewed
scipy.stats.skew(ser)









    Out[37]:





0.3148389791854863



In [70]:

    
p_ser = []
for val in ser.unique():
    p_of_val = ser[ser==val].count()/float(len(ser))
    p_ser += [p_of_val]
    
plt.bar(list(ser.unique()),p_ser)









    Out[70]:





<Container object of 20 artists>



In [63]:

    
ser.hist(bins=len(ser.unique()))









    Out[63]:





<matplotlib.axes.AxesSubplot at 0x109757650>



In [43]:

    
rv = scipy.stats.poisson(10)
rv.cdf(10)









    Out[43]:





0.58303975019298493



In [92]:

    
lam = 10
n = 10000
ser = Series(np.random.poisson(lam,n))
ser.describe()









    Out[92]:





count    10000.000000
mean        10.071900
std          3.188189
min          1.000000
25%          8.000000
50%         10.000000
75%         12.000000
max         24.000000
dtype: float64



In [51]:

    
p_below_mean = ser[ser<=ser.mean()].count().astype(float)/len(ser)
print("Percent of the samples are below the mean: %.0f %%" % (p_below_mean*100))
print("This equals (by definition) the value of the cummultative density function of the same poisson distribution: %.0f %%" % (100*scipy.stats.poisson(lam).cdf(lam)))
print("And the inverse of the mass function of that percentage is again the lambda: %.1f" % scipy.stats.poisson.ppf(p_below_mean,lam))









    



Percent of the samples are below the mean: 58 %
This equals (by definition) the value of the cummultative density function of the same poisson distribution: 58 %
And the inverse of the mass function of that percentage is again the lambda: 10.0



In [109]:

    
scipy.stats.poisson.ppf(scipy.stats.poisson(lam).cdf(lam),lam) == lam









    Out[109]:





True



In [99]:

    
sorted_ser = np.sort(ser)
plot(arange(0,1,.01),scipy.stats.poisson.ppf(arange(0,1,.01),lam),'rx',arange(0,1,1.0/len(sorted_ser)),sorted_ser,'b')









    Out[99]:





[<matplotlib.lines.Line2D at 0x1097c2c90>,
 <matplotlib.lines.Line2D at 0x1097c7210>]



In [100]:

    
scipy.stats.poisson.ppf(.2,lam)









    Out[100]:





7.0



In [102]:

    
# after sorting the index (id sorted_ser[i] or sorted_ser.ix[i]) still references the old order:
sorted_ser.values[.2 * len(sorted_ser)]









    Out[102]:





7



In [142]:

    
dist_percentage = 0.97
lowb, highb = scipy.stats.poisson.interval(dist_percentage,lam)
print("To get at least %.0f %% of a poisson distribution with lambda %i the given event must be between %i and %i." % (dist_percentage,lam,lowb,highb))
print("For this interval the sample series contains %.1f %% events" % (ser[(ser>=lowb) & (ser<=highb)].count()*100.0/len(ser)))









    



To get 1 % of a poisson distribution with lambda 10 the given event must be between 4 and 17.
For this interval the sample series contains 97.4 % events

Hypothesis Testing



In [70]:

    
# testing if the number of occurences matches expected ones:

alpha = 0.05
observed = np.array([126,174]) # absolut value of occurences of two events
expected = np.array([3.0/7, 4.0/7]) * observed.sum() # expected are here primarily given as percents/ration
X_squared, p_value = scipy.stats.chisquare(observed,expected)
print("The observed occurences have a propability of %.1f %%. The X^2 difference between observed and expected is %.2f." % (100*p_value, X_squared))
print(" => with significance niveau %.2f the null hypothesis (observed mean matches assumed one) is %s" % (alpha,"accepted" if p_value>alpha else "rejected"))









    



The observed occurences have a propability of 76.4 %. The X^2 difference between observed and expected is 0.09.
 => with significance niveau 0.05 the null hypothesis (observed mean matches assumed one) is accepted



In [69]:

    
# t test: testing how a observed mean relates to a assumed (== true) mean ie. if it's equal, greater or less
# TODO add assumptions on distributions and ways to performe less and greater tests

true_mean = 10
alpha = 0.05
ser = random.normal(true_mean,1,size=(100000))
err,p_value = scipy.stats.ttest_1samp(ser,true_mean)
print("Given the sample series it's %.1f %% likely that the mean of the underlying distribution is %.0f. Error is %.4f" % (p_value*100,true_mean,err))
print(" => with significance niveau %.2f the null hypothesis (observed mean matches assumed one) is %s" % (alpha,"accepted" if p_value>alpha else "rejected"))
test_mean = 9.9
err,p_value = scipy.stats.ttest_1samp(ser,test_mean)
print("Given the sample series it's %.1f %% likely that the mean of the underlying distribution is %.0f." % (p_value*100,test_mean))
print(" => with significance niveau %.2f the null hypothesis (observed mean matches assumed one) is %s" % (alpha,"accepted" if p_value>alpha else "rejected"))









    



Given the sample series it's 68.3 % likely that the mean of the underlying distribution is 10. Error is 0.4081
 => with significance niveau 0.05 the null hypothesis (observed mean matches assumed one) is accepted
Given the sample series it's 0.0 % likely that the mean of the underlying distribution is 10.
 => with significance niveau 0.05 the null hypothesis (observed mean matches assumed one) is rejected



In [ ]:

	prglength	finalwgt
count	4413.000000	4413.000000
mean	38.600952	8143.762911
std	2.791901	8270.404870
min	0.000000	118.656790
25%	39.000000	4033.606004
50%	39.000000	6456.845037
75%	40.000000	9496.575715
max	48.000000	261879.953864

	prglength	finalwgt
count	4735.000000	4735.000000
mean	38.522914	8647.243919
std	2.615852	9880.977442
min	4.000000	374.737005
25%	39.000000	3963.834554
50%	39.000000	6365.031570
75%	39.000000	9598.147669
max	50.000000	261879.953864

	prglength	finalwgt
count	-322.000000	-322.000000
mean	0.078037	-503.481008
std	0.176049	-1610.572572
min	-4.000000	-256.080216
25%	0.000000	69.771450
50%	0.000000	91.813467
75%	1.000000	-101.571954
max	-2.000000	0.000000